Advent of Code Stats update 2022

What can we learn from the stats?

  • toc: false
  • branch: master
  • badges: true
  • comments: true
  • author: Jesse van Elteren
  • image: images/aocstats.png
  • categories: []

Update Dec 25 2022 with 2022 scores

Hade a great time again, thanks Eric. And got once on the global leaderboard, so can die in peace now :)

Since 2018 I’m participating in 🎅Advent of Code🎅 and enjoying it a great deal. Since AoC has been running since 2015 there has been a sizable amount of data generated. Let’s see what we can learn, starting with the amount of stars awarded each season.

# hide
import numpy as np
import altair as alt
import pandas as pd
import database as db
alt.data_transformers.disable_max_rows()
def sfmono():
    font = "SF Mono" 
    font = "Lato" 
    font = "Arial" 
    fontsize = 16
    
    return {'height': 300, 'width': 450,
        "config" : {
             "title": {'font': font, 'titleFontSize':100},
             "axis": {
                  "labelFont": font,
                  "labelFontSize": fontsize,
                  "titleFont": font,
                  "titleFontSize": fontsize,
             },
             "header": {
                  "labelFont": font,
                  "labelFontSize": fontsize,
                  "titleFont": font,
                  "titleFontSize": fontsize,
             },
             "legend": {
                  "labelFont": font,
                  "labelFontSize": fontsize,
                  "titleFont": font,
                  "titleFontSize": fontsize,
                  
             },
             
        }
    }
alt.themes.register('sfmono', sfmono)
alt.themes.enable('sfmono')
conn = db.open_db('aoc.db')
# no points for 2020 day 1 and 2018 day 6
db.do(conn, "UPDATE scores SET points = 101 - position")
db.do(conn, "UPDATE scores SET points = null where year = 2020 and day = 1")
db.do(conn, "UPDATE scores SET points = null where year = 2018 and day = 6")
OperationalError: database is locked

# hide 
sql = """
    SELECT year, ROUND((SUM(both) * 2 + SUM(first)) *1.0 / 100000, 2)  as million_stars
    FROM finishers
    GROUP BY 1
"""
# -- SELECT * FROM scores LIMIT 5
data = db.do(conn, sql)
df = pd.DataFrame([{'year': year, 'image':'star'} for year, count in data for _ in range(int(count))])
# hide_input

alt.Chart(df).mark_text(size=45, baseline='middle').encode(
    alt.X('x:O',  axis=alt.Axis(labels=False, ticks=False,title='each star represents 100k stars handed out')),
    alt.Y('image:O', axis=None),
    alt.Row('year:N', header=alt.Header(title='')),
    alt.Text('emoji:N')
).configure_view(strokeOpacity=0).transform_calculate(
    emoji="{'star': '⭐'}[datum.image]"
).transform_window(
    x='rank()',
    groupby=['year', 'image']
).properties(height=10)
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():

In total there have now been more than 14M stars awarded! After a ‘rough’ 2016, AoC has been steadily growing with 2020 as a (Covid?) 100% boost. And 2022 is just barely finished, many people will earn stars in the days to come.

Edit: as BBQspaceflight indicated on the AoC Reddit, probably 2016 was not a rough year, but many people have been solving 2015 at a later time (e.g. they participated in 2018 and afterwards did 2015).

# hide_input
df = db.do_df(conn, 'SELECT * FROM scores WHERE star = 2 ORDER BY time')
df['puzzle'] = df['year'].astype(str) + ' - ' + df['day'].astype(str)
df['time'] /= 60

stripplot =  alt.Chart(df, width=40).mark_circle(size=8).encode(
    x=alt.X(
        'jitter:Q',
        title=None,
        axis=alt.Axis(values=[0], ticks=True, grid=False, labels=False),
        scale=alt.Scale(),
    ),
    y=alt.Y('time:Q', title=None, axis=alt.Axis(tickCount= 2, values=[0, 60,])),
    # color=alt.Color('puzzle:N', legend=None),
    column=alt.Column(
        'day:N',
        header=alt.Header(
            labelAngle=-90,
            titleOrient='bottom',
            labelOrient='bottom',
            labelAlign='right',
            labelPadding=3,
        ),
    ),
    row=alt.Row('year:N', title=None, spacing=10)
).transform_calculate(
    # Generate Gaussian jitter with a Box-Muller transform
    jitter='sqrt(-2*log(random()))*cos(2*PI*random())'
).configure_facet(
    spacing=0
).configure_view(
    stroke=None
).properties(
    width=18,
    height=120
).configure_mark(
    color='darkgreen'
)

stripplot
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():

For a better plot, look at Maurits vd Schee. I only plotted the full day. It’s a familiar sight: in general the times are below the hour mark, with a couple going higher. After 2020the completion times are more compressed. This year’s most difficult puzzle’s were 16 (Proboscidea Volcanium, tunnels with valves), 19 (Not Enough Minerals, collect geodes) and 22 (it’s a cube?)

Low completion times can be a result of two factors: * The puzzles were easier * The participants where better / more competitive

One way of investigating the difficulty of a year is by analyzing the completion rate: how many people got all the stars compared to the people that got only 1 star of day 25. These people did make it to day 25, thus put a considerable amount of effort in, but couldn’t finish all puzzles.

I didn’t wanted to take all participants from day 1, since that number quickly drops during the first couple of days.

# hide
from pprint import pprint
sql = """
    SELECT f1.year, f1.both, f1.first as only_first, ROUND(f1.both * 100 / (f1.first + f1.both),0) as perc_day_25_both
    FROM finishers f1
    WHERE day = 25
    

"""
# -- SELECT * FROM scores LIMIT 5
df = db.do_df(conn, sql)
df['x1'] = df['both'].cumsum()
df['x0'] = df['x1'].shift(fill_value=0)
# hide_input
alt.Chart(df).mark_rect().encode(
    alt.X('x0', title='# participants finishing all puzzles', axis=alt.Axis(labelFontSize=16, format='.2s')),
    alt.X2('x1'),
    alt.Y('perc_day_25_both:Q', title='% day 25 participants obtaining all stars', axis=alt.Axis(labelFontSize=10)),
    color=alt.Color('year:O', scale=alt.Scale(scheme='greens'))
)
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():

In the above chart, each rectangle symbolizes the people that solved all puzzles during the year. The height shows the completion rate.

The completion rate was very high in 2016 and 2017 and lowest in 2018. In 2020 many people finished all puzzles, corresponding with more participants that year.

For 2022 the verdict is still out, in a couple of months people will have had time to finish so the completion rate will stabilize. I find it interesting that it’s higher than 2021 already.

# hide_input
sql = """
WITH year_scores AS (
        SELECT year, user, SUM(points) as total
        FROM scores
        GROUP BY 1,2
        ORDER BY 3 DESC),
    
    started AS(
        SELECT year, both
        FROM finishers
        WHERE day = 25
        GROUP BY 1
    )

SELECT y.year, COUNT(y.year) as 'participants with points', s.both as 'participants finishing all puzzles'
FROM year_scores y
JOIN started s
ON s.year = y.year
GROUP BY y.year
"""
df = db.do_df(conn, sql)
df['% Finishers that got points'] = (df['participants with points'] / df['participants finishing all puzzles'] * 100).astype(int)

descriptives = alt.Chart(df).mark_bar().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y('year:O', title=None)
).properties(width=100, height=200).repeat(
    column=['participants finishing all puzzles', 'participants with points', '% Finishers that got points']
).configure_mark(
    color='darkgreen'
)
descriptives
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():

Again note that 2022 is very fresh still.

We see that: * The amount of finishers (people getting all stars) peaked in 2020 * The amount of people that got points on the leaderboard is slowly increasing with 2015 also being very high. Why? * The percentage of finishers getting points is varying. Lower percentages could indicate how competitive the year was. Another indicator can be the time it took to solve a puzzle.

# hide_input
sql = """
WITH totaltime AS(
        SELECT year, position, ROUND(SUM(time) * 1.0 / 3600, 1) as total
        FROM scores
        WHERE (position = 1 or position = 100) and star = 2
        GROUP BY 1,2
    )

SELECT t1.year, t1.position, t1.total as 'total_time (hours)'
FROM totaltime t1
GROUP BY 1,2
"""
times = db.do_df(conn, sql)

chart = alt.Chart(times).mark_line(color='#00cc00').encode(
    x=alt.X('total_time (hours):Q', title='Total time to complete all 25 days'),
    y='year:N',
    detail='year:N'
)
# Add points for life expectancy in 1955 & 2000
points = chart.mark_point(
    size=100,
    opacity=1,
    filled=True
).encode(
    x='total_time (hours):Q',
    y='year:N',
    color=alt.Color('position:O',
        scale=alt.Scale(
            domain=['1','100'],
            range=['#00cc00', '#009900']
        )
    )
)

chart + points 
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():

The fastest completion times add up to around 3 hours, which is amazing. Since nobody ever finished #1 at all puzzles, this is a theoretical minimum.

The completion times of #100 add up to a more ‘human’ amount. These times are still way below the amount of time a ‘normal’ participant spends on AoC. For example I consider myself an enthusiast, but my completion times are normally about 2-3x the #100.

2022 was very similar compared to 2021.

# hide_input
df2 = df.merge(times.loc[times['position']==100])
points = alt.Chart(df2).mark_point(color='darkgreen').encode(
    x=alt.X('participants finishing all puzzles:Q', scale=alt.Scale(type='linear')),
    y=alt.Y('total_time (hours):Q', scale=alt.Scale(type='linear'), title='Total time #100')
)

text = points.mark_text(
    align='left',
    baseline='middle',
    dx=7
).encode(
    text='year'
)

points + text

# Here is where things get hazy: there seems to be some kind of relation between the % of participants that get points and the total time taken adding up all the days #100 LB. Just leaving this here for your imagination.
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():

2022 will move more to the right given time. There does seem to be some correlation between the total time the #100 took and the amount of participants finishing all puzzles. Although I’m not sure if it’s causal, could be there are other variables playing a role.

Getting leaderboard points is special (I made once and got 39 points, mission complete!!!). There are people who do it consistently. Let’s give the top 30 some extra recognition🎈

# hide_input
sql = """
WITH year_scores AS (
    SELECT year, user, SUM(points) as points,SUM(time) as time, COUNT(user) as counts
    FROM scores
    GROUP BY 1,2
    ORDER BY 3 DESC)

SELECT  user, 
        COUNT(user) as amount_seasons, 
        CAST(SUM(ifnull(points,0)) AS INT) as total_points, 
        SUM(counts) as total_leaderboard_placements, 
        CAST(AVG(counts) AS INT) as average_amount_lb_placements, 
        CAST(AVG(points) AS INT) as avg_points_season
FROM year_scores
GROUP BY 1
ORDER BY 3 DESC
"""
df = db.do_df(conn, sql)

alt.Chart(df[:30]).mark_point(color='darkgreen').encode(
    alt.Y('user:N', sort=['total_points']),
    alt.X(alt.repeat("column"), type='quantitative', axis=alt.Axis(labelFontSize=15)),
    
).properties(
    width=175,
    height=450
).repeat(
    # alt.Y('user:N', sort=['total_points']),
    column=['total_points', 'total_leaderboard_placements'],
)
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():
# hide_input
alt.Chart(df[:30]).mark_point(color='darkgreen').encode(
    alt.Y('user:N', sort=['total_points']),
    alt.X(alt.repeat("column"), type='quantitative'),
    
).properties(
    width=175,
    height=450
).repeat(
    # alt.Y('user:N', sort=['total_points']),
    column=['average_amount_lb_placements', 'amount_seasons'],
)

All the people on this top 30 list are amazing, but some awards to hand out: * 🏆Robert Xiao managed to get the most amount of points and overall most leaderboard placements * 🏆betaveros got on average most points & leaderboard entries (ignoring anonymous user here). betaveros also managed to get 50 entries is 2018, which was a one-time event * 🏆glguy for getting the highest score while getting points in all 8 seasons

Doing AoC once and get LB points is nice, but it’s even nicer to do it twice, thrice, etc.

# hide_input
alt.Chart(df).mark_bar().encode(
    alt.X('amount_seasons:N', title= '# seasons to get points'),
    alt.Y('count()', title='amount',scale=alt.Scale(type='log'))
).configure_mark(
    color='darkgreen'
)

Most of the people that get points manage to do it only once. The y-axis is logaritmic. Who are having so much grit to get points all 8 seasons?

# hide_input
df.loc[df['amount_seasons']==8].sort_values('avg_points_season', ascending=False).loc[:,'user':'total_leaderboard_placements']
user amount_seasons total_points total_leaderboard_placements
10 glguy 8 11588 184
14 etotheipi1 8 10488 191
19 msullivan 8 9219 162
23 Kevin Yap 8 8820 154
39 (anonymous user #60233) 8 6383 128
97 Roderic Day 8 3719 75
115 lukechampine 8 3352 75

We lost Daniel Rejment and Shane Mc Cormack compared to last year. glguy topping the list. Coming back to our competitiveness discussion, how many points did they score together?

# hide_input
sql = """
WITH year_scores AS (
    SELECT year, user, SUM(points) as total
    FROM scores
    GROUP BY 1,2
    ORDER BY 3 DESC),
    
'toppers' AS (
    SELECT user
    FROM year_scores
    GROUP BY 1
    HAVING count(user) > 7)

SELECT s.year, s.user, s.total as points
FROM toppers as t
JOIN year_scores as s
ON s.user = t.user

"""
# -- SELECT * FROM scores LIMIT 5
df = db.do_df(conn, sql)
df['user'].value_counts()

alt.Chart(df).mark_bar(opacity=0.8).encode(
    x='year:N',
    y='sum(points)',
    color=alt.Color('user')
).properties(
    width=400,
    height=300)
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():

It’s varying but 2020 and 2021 and 2022 are lower. This could reflect: * increased competitiveness during the years * natural variation * legends getting older😊

All in all an amazing achievement! All in all I think there is a strong case for AoC being more competitive after 2019.

Let’s finally turn to which puzzles were easiest or hardest.

# hide_input
sql = """
    SELECT year || '-' || day || '-' || star as puzzle, user, time as "time (seconds)"
    FROM scores
    WHERE position=1
    ORDER BY 3
"""
# -- SELECT * FROM scores LIMIT 5
df = db.do_df(conn, sql)
sql = """
    SELECT year || '-' || day || '-' || star as puzzle, time as "lb full (seconds)"
    FROM scores
    WHERE position=100
    GROUP BY 1
    ORDER BY 2
"""
# -- SELECT * FROM scores LIMIT 5
df_100 = db.do_df(conn, sql)
df[:5]
puzzle user time (seconds)
0 2022-3-1 ostwilkens 10
1 2022-4-1 max-sixty 16
2 2019-1-1 bluepichu 23
3 2018-1-1 Tris Emmy Wilson 26
4 2021-1-1 betaveros 28

Previous year we had a record of 23 second, but thanks to ChatGPT we have 2 new ‘easiest’ puzzles.

# hide_input
df_100[:5]
puzzle lb full (seconds)
0 2021-1-1 65
1 2022-1-1 76
2 2019-1-1 84
3 2018-1-1 92
4 2021-2-1 98

The leaderboard capped (the #100 completed the puzzle) after barely a minute in 2021 for the first star!

# hide_input
df = df.loc[df['puzzle'].str.endswith('2'),:].copy()
df['time (seconds)'] = (df['time (seconds)'] / 60).astype(int)
df = df.rename(columns={'time (seconds)':'time (minutes)'})
df = df.sort_values('time (minutes)', ascending=False)
df['title'] = 'Title'
df.iloc[:10,3] = ['Beverage Bandits', 'Reservoir Research', 'Immune System Simulator 20XX', 'Monkey Map', 'Jurassic Jigsaw', 'Not Enough Minerals', 'Amphipod', 'Many-Worlds Interpretation', 'Wizard Simulator 20XX', 'Beacon Scanner']
df[:10]

# Beverage Bandits, Reservoir Research, Immune System Simulator 20XX, Jurassic Jigsaw, Amphipod, Many-Worlds Interpretation, Beacon Scanner, Wizard Simulator 20XX, Coprocessor Conflagration, Flawed Frequency Transmission
puzzle user time (minutes) title
399 2018-15-2 Simon Parent 36 Beverage Bandits
398 2018-17-2 Raven Black 33 Reservoir Research
395 2018-24-2 Simon Parent 28 Immune System Simulator 20XX
394 2022-22-2 mrphlip 25 Monkey Map
393 2020-20-2 xiaowuc1 25 Jurassic Jigsaw
392 2022-19-2 lukechampine 24 test
390 2021-23-2 goffrie 23 Amphipod
389 2019-18-2 glguy 22 Many-Worlds Interpretation
388 2015-22-2 Paul Hankin 21 Wizard Simulator 20XX
387 2021-19-2 ecnerwala 21 Beacon Scanner

The longest 3 solve times were all in 2018! Shoutout to Simon Parent for solving 2 out of the top 3. This list mostly has puzzles that just take a long time to code, with Beverage Bandits as perfect example.

# hide_input
df_100 = df_100.loc[df_100['puzzle'].str.endswith('2'),:].copy()
df_100['lb full (seconds)'] = (df_100['lb full (seconds)'] / 60).astype(int)
df_100 = df_100.rename(columns={'lb full (seconds)':'lb full (minutes)'})
df_100 = df_100.sort_values('lb full (minutes)', ascending=False)
df_100['title'] = 'Title'
df_100.iloc[:10,2] = ['Medicine for Rudolph', 'Not Quite Lisp', 'Wizard Simulator 20XX', 'Radioisotope Thermoelectric Generators', 'Beverage Bandits', 'Slam Shuffle', 'Many-Worlds Interpretation', 'Experimental Emergency Teleportation', 'Grid Computing', 'Immune System Simulator 20XX']
df_100[:10]
puzzle lb full (minutes) title
399 2015-19-2 232 Medicine for Rudolph
398 2015-1-2 186 Not Quite Lisp
397 2015-22-2 183 Wizard Simulator 20XX
396 2016-11-2 164 Radioisotope Thermoelectric Generators
393 2018-15-2 143 Beverage Bandits
392 2019-22-2 123 Slam Shuffle
390 2019-18-2 117 Many-Worlds Interpretation
388 2018-23-2 100 Experimental Emergency Teleportation
387 2016-22-2 88 Grid Computing
386 2018-24-2 87 Immune System Simulator 20XX

If we look at when the leaderboard capped some different puzzles show up. I feel that this list has some more algoritmic challenges (Slam Shuffle for example, but Medicine for Rudolph as well). Also 2015 shows up in the top 3. If these puzzles would be recycled in 2021 they would have been solved faster.

Overall, I feel 2018 is a strong contender for the most difficult year, with day 15 the most difficult puzzle in the history of Advent of Code!

Hope you enjoyed this analysis and see you back next year! 🎄⭐🎅